A Hybrid Approximate XML Subtree Matching Method Using Syntactic Features and Word Semantics

نویسندگان

  • Wenxin LIANG
  • Haruo YOKOTA
چکیده

With the exponential increase in the amount and size of XML data on the Internet, XML subtree matching has become important for many application areas such as change detection, keyword retrieval and knowledge discoveries over XML documents. In our previous work, we have proposed leaf-clustering based approximate XML subtree matching methods using syntax information of both the clustered leaf nodes and the corresponding paths. In this paper, we propose a hybrid subtree matching method, in which subtree matching is determined by using the word semantics based on WordNet thesaurus in leaf nodes and the syntactic features in the relevant paths. We also propose a one-pass hash join technique to reduce the additional join cost caused by the extra words expanded by the WordNet. We perform experiments to evaluate performance and matching precision and recall comparing the hybrid method with the original syntax-based methods. The experimental results indicate that the proposed hybrid method with one-pass hash join, comparing with the existing path-based SLAX algorithm, can effectively improve the precision and recall with about only 5% increase of the execution time for the leaf-clustering based subtree matching.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Processing of XML Tree Pattern Queries

In this paper, we present a polynomial-time algorithm for TPQ (tree pattern queries) minimization without XML constraints involved. The main idea of the algorithm is a dynamic programming strategy to find all the matching subtrees within a TPQ. A matching subtree implies a redundancy and should be removed in such a way that the semantics of the original TPQ is not damaged. Our algorithm consist...

متن کامل

Semantics of haq in the Glorious Quran

   Meaning plays a very important role at all levels of linguistic analysis and in linguistics. We can say that the word itself and out of the chain of speech doesn’t show the true meaning. It should be in relation with other signs within the language that its meaning be relived.   Quran, the precious word of Allah, contains words that take a variety of meanings in the syntactic and topical con...

متن کامل

TASM: Top-k Approximate Subtree Matching

We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runti...

متن کامل

TASM: Top-k Approximate Subtree Matching

We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runti...

متن کامل

Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

We propose Bilingual Tree Kernels (BTKs) to capture the structural similarities across a pair of syntactic translational equivalences and apply BTKs to sub-tree alignment along with some plain features. Our study reveals that the structural features embedded in a bilingual parse tree pair are very effective for sub-tree alignment and the bilingual tree kernels can well capture such features. Th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009